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Abstract 

We propose a method to efficiently construct data-dependent kernels which can make use of large quantities 
of (unlabeled) data. Our construction makes an approximation in the standard construction of semi-supervised 
kernels in Sind hwani et al.| < f2005| l. In typical cases these kernels can be computed in nearly-linear time (in the 
amount of data), improving on the cubic time of the standard construction, enabling large scale semi-supervised 
learning in a variety of contexts. The methods are validated on semi-supervised and unsupervised problems on 
data sets containing upto 64'000 sample points. 



1 Introduction 



Semi-supervised methods of inference aim to utilize a large quantity of unlabeled data to assist the learning process. 
Often this is achieved by using the data to define a data-dependent kernel which captures the geometry of the data 
distribution, as revealed by the sample. The norm in the reproducing kernel Hilbert space (r.k.h.s.) associated to 
such a kernel typically includes a data-dependent "intrinsic regularizer" component which captures the smoothness 



of functions on the data sample. Associated kernel methods such as LapSVM (Belkin et al. 2006 ) have been shown 
to achieve state of the art performance in classification. 



A drawback of the standard semi-supervised kernel construction, due to Sindhwani et al. ( 2005 1, is its large 
computational cost which is cubic in the number of (unlabeled) data points, rendering the method infeasible for 
even moderately-sized problems. Several solutions to this problem have been proposed; most apply to particular 
algorithms only ( Zhu and Lafferty 2005] |Collobert et al.| |2006| Garcke and Griebel 2005 Tsang and Kwok 

lFor 



|2006[ [Sindhwan i and Keerthi 2006; Melacci and Belkin| |201 l| l or are restricted to the special case of transduction 



( Mahdaviani et al. 2005). In contrast, we provide efficiently computable data-dependent kernels which can be used 



in any kernel method. 

The kernels we study in this work are obtained by making an approximation in the standard construction of 



Sindhwani et al. (2005 1, and can be formed for the same general class of "intrinsic regularizes" considered therein 
(see the details in Section [2}. Our starting point is a given intrinsic regularizer, on functions h g M. x , of the 
form regq(h) := h T Qh, defined on the measurements h :— (h(xi))i £ E™ of h at a data sample X$ '■= 



1 



{xi, . . . , x n }, where Q is some symmetric positive semi-definite (often very sparse) matrix. Such regularizes are 
used in the construction of data-dependent kernels on X (used, for example, for semi-supervised learning). Implicit 
in this choice of regularize! - is the assumption that the pseudoinverse Q + is a good covariance on the finite set 



X$ (see Theorem 3.1 1. Our proposed kernels are obtained by replacing the regularizer regg (fa) with an intrinsic 
regularizer which measures each function h at a small subsample X$ = {x Sx , . . . ,x S(i } C X$ and interpolates 
the measurement fa := (h(x s .))i 6 W 1 to a function fa* € W 1 over X$ using the covariance Q + and uses fa* to 
approximate fa: the approximated intrinsic regularizer is thus regg (h*). For very large n we do not need to measure 
a function h at all n sample points since regg (fa*) will be a good approximation to regg (fa) whenever h is in some 
class of sufficiently smooth functions in the sense specified by regg. We then form an r.k.h.s. of functions over the 
input space, whose norm includes our reduced intrinsic regularizer as a component. 

The surprising and useful result we prove is that while the complexity of computing data-dependent kernels is 
cubic in the number of points at which functions are measured it is only linear in the number of non-zero entries of 
Q, which in typical cases leads to nearly-linear complexity in n. Thus by disconnecting the number of points used 
to build the regularization matrix Q (typically all n data points) and the number of points at which functions are 
measured we are able to practically achieve genuinely large-scale semi-supervised learning. For example when Q 
is a graph Laplacian our method allows us to use a huge amount of data to build the graph and define the intrinsic 
regularizer, obtaining a data-dependent kernel on the input space X in nearly-linear time. This is important since 
graph building is often not robust at small sample sizes: experimentally we demonstrate a significant advantage can 
be gained from the ability to exploit a much larger quantity of unlabelled data. 

Used with the S VM our kernels can be informally viewed as providing an efficient approximation of LapS VM, 
and exhibits comparable performance on small datasets. On larger datasets, where LapSVM is infeasible, the 
method comfortably outperforms the RBF kernel and a more naive implementation of a "budget" LapSVM (defined 
by discarding the majority of unlabeled data). We also provide an application to clustering. 



1.1 Preliminaries 

We consider the design of kernels suitable for the (semi-)supervised learning problem in which we must infer a 
regression or classification function h : X — >• y mapping instances x e X to outputs y E y. In particular we 
suppose there exists a distribution /j, over the set X x y of labeled instances and that we have a partially labeled 
sample S = {(x%, yx), . . . , (x m , y m )}U{x m+ i, . . . , x n } drawn from the product distribution ^ n>m := fx m x (J%T m , 
where fix is the marginal distribution over the instance space X. We denote Xg :== {xi : (xj, yj) £ S or xj, € S}. 

For a positive (semi-)definite kernel K : X x X — >• R we denote by Hk = span{i^(a;, •) : x € X} (where 
completion is w.r.t. the r.k.h.s. norm) its unique r.k.h.s.. 

For a matrix M we denote by M + its Moore-Penrose pseudoinverse and by im(Af) and leftnull(Af) its 
column and left null spaces. We denote by I n and 1„ the n x n identity matrix and matrix of all ones respectively, 
IIMHoo = maxy \Mij\, and ||M|| 2 = max{\/A : A is an eigenvalue of M T M} and k(M) := ||M|| 2 ||M + ||2. 
We denote the standard basis in R™ by {e^}. When M is symmetric and positive semi-definite we denote := 
z T Mz 

We view the elements of the set R v of real-valued functions on a finite set V = {i>i, . . . ,Vt} as vectors / G R* 
via f(vi) = fi. 



2 Review of semi-supervised kernel methods 

We here recall a standard methodology to define a data-dependent kernel for semi-supervised learning in which the 
norm of the associated r.k.h.s. captures the smoothness of each function w.r.t. the data sample. Given an arbitrary 
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kernel K : X X X 



with associated r.k.h.s. %. 



consisting of functions from Hk, m which the inner product is modified 

(h,g) K ■■= (h,g)K + r){u(h),u(g)) u , h,g e H K 



Sindhwani et al. (2005! demonstrate that the space H K , 

(1) 



and such that u : % 



K 



where U is any linear space (to be chosen) with positive semi-definite inner product (■,-)u 
U is a bounded linear map, is an r.k.h.s.. Typically the term (u(h),u(h))u, in the expansion of \\h\\ ~, acts as a 
data-dependent "intrinsic regularizer" and captures a notion of smoothness of h over the empirical sample. If we 
define, h := (hi)i G R n as the vector of point evaluations of h on the sample S, hi := h(xi), where Xi 6 S, then, 
in particular, when, 



(u(h),u{g)) u = h T Qg, 
for some symmetric p.s.d. regularizer matrix Q then the r.k.h.s. inner product ([TJ becomes, 

(h,g) K ■= (h,g) K + v h Qg, h,geH K - 



(2) 



(3) 



For such a Q, and associated intrinsic regularization operator regq : h h T Qh, we say that h £ Hk is 
regq-smooth whenever regq (ft.) is small. We have, 



Theorem 2.1 dSindhwani et al. (2005 1, Proposition 2.2). The r.k.h.s. %^ consisting of functions from Hk with 
inner product has reproducing kernel K : X x X — > M given by, 

K(x,x') = K{x,x')-r]kZ{I + nQK)- l Qk x ,, 



(4) 



where k x = (K(xi, x), . . . , K{x n , x)) 1 , and K is the n x n Gram matrix Kij = K(x-i, Xj) for i, j < n. 



Kernels of the form Q can be used in any kernel method as a means to achieve the semi-supervised goal of 
exploiting unlabeled data. One common choice is constructed as follows: given a sample of labeled and unlabeled 
points S = {(xi, yi), . . . , (x m ,y m )} U {x m+ i, . . . , x n } drawn from the distribution \i 
intrinsic regularise! - , 



x n 7 ^ m consider the 



(u(h),u(h))u :=Us{h,h) 

= , 1 rr V\(Kxi) - hix^fWiXuXj), 

n(n — 1) 

ij 

where W : X x X —> R captures similarity or "weight" between data points, for example W(x, x') 
for some norm || • || over X. Note that Us{h,g) — n ^_^ h T Lg where L = D — W is a graph Lapl 
whose edge weights are controlled by W = (W%j) = (W(xi,Xj)) and D. L j = Sij^2 k Wik^ This smoothness 



functional is a typical regularizer in semi-supervised learning ( |Zhu et aL 2003 Belkin et al. 2006 , 2004) which 
punishes functions which do not vary smoothly over the sample. Other choices for the intrinsic regularizer in Q, 
include many derived from the Laplacian such as the normalised Laplacian Lnorm or L p , exp(L), r(L) for some 
real number p or function r (see e.g. Smola and Kondor 2003]l). 

The key drawback of the construction Q (as was pointed out initially for the case of LapSVM in |Belkin et al.| 



(2006)) is the 0(n 3 ) complexity required to invert the matrix I + r/QK, which renders any derived method such as 
LapSVM infeasible for even moderately large unlabeled samples. Even once I+r/QK is inverted simply evaluating 
the kernel K at any pair (x, x') requires an 0(n 2 ) computation. 
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3 A general method for efficiently constructing data-dependent kernels 



Given a partially labeled sample S, we now detail the construction of efficiently computable data-dependent kernels. 
Recalling the notation of Section |2j we thus suppose that a base kernel K and intrinsic regularization operator 
regg : h i-> h T Qh where Q is a p.s.d. regularization matrix, are given (we will consider typical special cases 
later) and we are interested in constructing an r.k.h.s. H K consisting of functions in Hk whose inner product 
achieves the regularization effect of ^ but for which, in contrast to the reproducing kernel K is efficiently 
computable. Approximating ([3]) subject to computational constraints appears difficult in general and we therefore 
restrict our attention to a certain specific form of intrinsic inner products which we now describe. We denote a 
subsample X$ — {x Sl , . . . , x Ss } C X$ with \Xg | =: n -C n and for a given h G Hk denote its evaluation on X$ 
by h := (/ij), G M" where hi := h(x Si ). We then consider those r.k.h.s. H & whose inner products are of the form, 

(h, g) K ■= (h, g) K + r]h T Qg h, g G Hk, (5) 



where the n x n symmetric p.s.d. matrix Q is to be chosen. Recalling Theorem 2.1 we see that the kernel K is 
given by, 

K(x, x') = K(x, x') - V kl(I H + rjQK)- 1 ^,, (6) 

where, for x G X, k x = (K(x Sl , x), . . . , K(x s ~ , x)) T , and K is the n x n Gram matrix Kij = K (x Si , x Sj ) 
for i, j < n. Given Q, the complexity of computing (§lise>( rr), thus whenever Q is efficiently computable the 
complexity is substantially less than the 0(n 3 ) complexity of computing Suppose the subsample X$ is giverjj] 
and consider the choice of intrinsic regularization matrix Q and associated operator regg : h i-> h T Qh. The 
most straightforward form of this approach would be, given X$, to discard all remaining data instances Xs\X$ 
and construct Q using only the subsample Xg - typically, for example, Q might be derived from the Laplacian 
of a graph built on the subset Xq. In discarding almost all unlabeled data no advantage can be gained from it and 
this simplistic method should act as a benchmark which any proposed method should improve upon. The task is to 
choose annxn matrix Q which achieves the effect of Q exploiting all unlabeled data. It is perhaps surprising that 
such a Q exists: for example, when Q is an n x n Laplacian of a graph Q constructed on all of X$, we can find an 
n xn regularization matrix Q whose associated regularization operator involves the full structure of the graph Q. 

To motivate a natural choice for Q in {i} we first recall some well-known facts regarding the duality between 
positive semi-definite regularization operators on spaces of functions and kernels on their domain defined by their 
Green's functions (e.g. |Smola et al.|[l998[ ). The following is a special case for finite input sets (the proof is given in 
the Appendix): 



Theorem 3.1. (e.g. Smola and Kondor 2003 Theorem 4) Given a finite set of points V — {vi, . . . , vt}, consider 
h G K v as a vector h G K* via h{vi) := h T ei = hi. Consider further a regularization operator on such functions, 
reg : R — > K given by reg(h) = h T Rh, where R is a symmetric positive semi-definite matrix. Then the Hilbert 
space W = im(_R) C R* of real-valued functions on V with inner product {h^g)-u = h T Rg is an r.k.h.s. whose 
reproducing kernel K : V x V — > K is given by R + , i.e. such that K(vi, Vj) := i?J = ej R + Bj. 

The Green's function in this case simply being the matrix pseudoinverse of the regularization operator R. Thus 
sensible regularization operators on functions over finite sets define sensible reproducing kernels via their pseu- 
doinverse. A natural choice for the regularization operator Q is immediately motivated by the above observations: 



*A random subsample of Xg seems sensible as it would ensure that Xg is an i.i.d. distribution from the underlying data-generating distribu- 
tion. 
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for the given intrinsic regularizer Q we can view Q + as a kernel on Xg via Q + (xi, Xj) = Qfj. In particular, the 
submatrix Q + \% s = (QX Sj '■ hj< being the gram matrix of the restriction of this kernel to Xg, is always 
a valid positive semi-definite ke rnel on Xg which captures precisely the affinities on Xg induced by Q + . Thus, 



recalling the duality of Theorem 3.1 our proposed choice of regularizer Q is, 

Q=(Q + \x s ) + > (7) 

i.e. Qfj = Qf. s ., and its associated operator regg- : h i-> h T Qh, is a natural regularizer for functions on Xg. 

Since Q + given by ^ is the submatrix of the pseudoinverse of the n x n matrix Q it might seem that Q is not 
efficiently computable but in Section|4]we will see that for typical choices of Q, an e-approximation to the kernel 
Q + is computable in nearly-linear time. 

3.1 Interpreting the intrinsic regularizer 

So far we have motivated our choice of regularization matrix Q = (Q + \x s ^J in (6 i by demonstrating that its 



pseudoinverse Q + is a natural kernel on the subset Xg, and invoking Theorem 3.1 We now give the key result 
interpreting the intrinsic inner product in terms of the interpolation of functions defined on Xg to Xg. 

We first introduce some notation: note that we can reorder the set Xg ={%%,... ,x n } such that w.l.o.g. we can 

Qnn Qni 



consider Xg — {xx, . . . , x^}- We then write Xg = Xg\Xg, n := \Xg\ = n — n and Q 



Qnn Qnn 



where Qnn — (Qij '■ %i € <%s, %j € Xg) and Qnn, Qnn etc. are defined analogously. For a function h £ M. x 
define intq(fi) := argminj gR A- s {regg(/) : f\g = h} the minimum (semi-)norm interpolants of h. We have: 

Theorem 3.2. Suppose Q is such that the generalized Schur complement Qss — QnnQnnQnn is nonsingula^ 
then, for any h, g £ R x , the intrinsic inner product in (J^Jl with Q defined by satisfies, 

h T Qg = (h*) T Qg*, 

where h* and g* are any elements of the sets of interpolants intQ(fa) and intQ(g). 

Proof. Consider some h,g £ M. x so that h* £ intg(ft.), g* £ intg(g). For any / £ M. Xs note that, 

-« ( «=(/'fJ T (fe t )(/£)■ (8 > 

and differentiating |8j w.r.t. f\x s , setting this to zero when f = h* and setting h* \% s =/iwe obtain, 

Qnnfa \x s Qnn.h' (9) 

so that h* — [ , ^ ~ | , for some u £ leftnull(Q,-,« ), and we can similarly derive q* = ( J? ^ 

[ -QtnQnrih + U )' Wmih y \-Q£nQnn9 + V 



2 This is guaranteed, for example, when Q is positive definite. 
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for some v G lcftnull(Qnn). We obtain, 

h QQ — h {Qnn QnnQnnQnn) 9 

+ u T Qmh + v T Q, l?l g 
= h T (Q + \xXd=h T Qg. 



The final line following from the formula for generalized inverses of partitioned matrice^] (see e.g. 



Rohde 



1965) 



and since (|9| implies that Qnnh±.\eitmil\(Qnn) 3 u (and we similarly remove the term in v). □ 

In particular, the smoothness that regg(fa) measures is therefore the regg -smoothness of any minimum (semi- 

)norm interpolant h* of h. There is also a Bayesian interpretation: the reg^-smoothness of h is the regg- 

smoothness of the posterior mean of Bayesian inference in the GP using covariance Q + with observations h 
sampled at X$ in the limit of no noise - there is a well-known equivalence with the minimum semi-norm inter- 
polant. 

3.1.1 Spcialization to graph Laplacian-based regularizes 

Via Theorem|3.2|we see that reg^ takes into account the whole of the data sample (whenever Q does). We now 
expand upon this in the common case when Q is (derived from) a graph Laplacian. Given a graph Q = (V, £ ) 
constructed on X$ (i.e. there is a bijection X$ —> V), suppose that regg measures smoothness of functions over the 
vertices V w.r.t. the graph structure (as explained in Section [2]), the typical example being when Q is a Laplacian. 
The Q-smoothness reg^(h) = h T Qh of h <E M. Xs is then small whenever h admits an extension to the full vertex 
set V which respects the structure of Q as illustrated in Figure [T] 

4 Complexity analysis 

We now show that for typical choices of intrinsic regularization matrices Q (an approximation to) our kernel K is 
efficiently computable. If Q is computed then there is a one time C(n 3 ) cost to construct (J^ + j]QK)~ 1 following 
which kernel evaluations can be computed in 0(n 2 ) time. Therefore it is required to demonstrate the complexity 
of computing Q. 

We first consider the case when Q is symmetric, diagonally dominant sparse matrix with s non-zero en- 
tries, and suppose for simplicity that s > n. In this case we show that there is an algorithm with complexity 
O(nslogn(loglogn) 2 log = + n 2 n) which returns an e-approximation A to the kernel matrix Q + . We need the 
following lemma which is a recent example of nearly-linear time solvers for sparse symmetric diagonally dominant 



linear systems pioneered by|Spielman and Teng (2006). 



Lemma 4.1. \ Koutis et al.\ 201 l^Given a symmetric diagonally dominant nxn matrix M with s non-zero entries 



and a vector b £ M. n there exists an algorithm which in expected time 0(s logn(log logn) 2 log -) computes z 6 
satisfying \\z - M+b\\ M < e||M+6|| M . 

We can now prove the following: 



Which, when Q is positive definite reduces t o the well-known formula. 



4 Published papers with similar guarantees are Koutisetal. 2010 Spielman and Teng 2006 '. 
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(a) non-Q-smooth labeling 



(b) Q-smooth labeling 



(c) Graph on reduced sample 



Figure 1: Artificial illustration: concentric circles. A 2-NN graph built on the large data sample (black spots con- 
nected by edges) captures the underlying structure of the two concentric circles defining the two classes. Suppo se Q 



captures smoothness on this graph. The subsample X$ is highlighted red and green. The hypothesis of Figure 1(a) 



(the separating hyperplane is shown by the dotted line) is n on-Q -smooth: there is no Q-smooth extension of the 



labeling of X$ to the full graph. The hypothesis of Figure 1(b) separatin g the two classes, is Q-smooth: there 



exists a Q-smooth extension of the labeling of X$ to the full graph. Figure 1(c) a 2-NN graph built on Xg does 
not capture the structure of the data-distribution and the correct labeling is not smooth w.r.t. this graph. 



Theorem 4.2. Given a symmetric diagonally dominant n x n matrix Q with s non-zero entries an approximation 
A to the kernel matrix Q + on Xg can be computed in expected time 0(ns log n(\og log n) 2 log - + n 2 n) where, 



and thus in sup norm, 



and in spectral norm, 



\Aij - Q+-| < eQt iSi Qt jSj , 



|A-Q + |U<e sup (Q+) 2 , 
{i : x,ex s } 



\A-Q+\\ 2 <e Y, ^) 2 ' 

{» : Xi£X s } 



Proof. We begin by making n calls to the solver of Koutis et al. ( 201 1 1 to solve the equations 



(10) 



(ID 



(12) 



for each i where x s . £ X$, giving Z j suc h that \\zj — Q + e s .\\Q < e| |Q + e Si | |q = eQf. s . in total time 
O(nslogn(loglogn) 2 log i) by Lemma 4.1 Now let Z := ( z\ ... J and 

A := Z QZ, 

and note that A can be computed with 0(sn + n 2 n) operations since Q has s non-zero entries. Now note that 

\Q± - Aij\ = \Qt iSj - Aij\ = \e T Si Q+e S] - zJQz 3 \ = \{Q+e Si - ^) T QQ+e Sj + (Q+e Sj - ^) T QQ+e Si - 
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(Q+e St ~ Zi yQ(Q+e Sj - Zj )\ < \\Q+e Si - z t \\ Q Q+ S] + \\Q+e Sj - z j; \\qQ+ s . + \\Q+e Si ~ z l \\ Q \\Q+e Sj 
Zj\\q < (2e + £ 2 )Qt iS Qf s- which (after rescaling e' = 2e + e 2 ) proves i 



10 1. Now note that, 



\A-Q^ 



\2 = SUP 

{* : ||x||<l} 



sup 

{x : |H|<1} 



x T (A-Q + )x 



ij<n 



< e 



sup 



{-:|MI<1};< K 



+ 



j<n 



i < n 



which proves ( 1 1 



□ 



The linear solvers used to compute the regularizer Q, utilise low stretch spanning tree preconditioned. In 
practice we use a recent practical implementation ( Koutis 201 1 ) of these ideas (though not precisely the algorithm 
attaining the guarantee above) and achieve linear-time scaling in practice. We can also derive a similar result for 
an approximation of the kernel K, and we essentially incur an additional logarithmic dependence upon where 

Amin := min{A : A is an eigenvalue of Q 1 + r]K}. The following theorem is proved in the appendix: 

Theorem 4.3. Given a symmetric diagonally dominant, n x n matrix Q with s non-zero entries let Q be as defined 
in and suppose further that Q is positive definite. Let A m i n := min{A : A is an eigenvalue of Q" 1 + ijK}. If 
e A m in, then an approximation Ka to the kernel K defined by and where K satisfies sup xe ^ K (x, x) = k < 
oo, can be computed in expected time 0(ris log n (log log n) 2 log e g " T ' fc + n 2 n), where q := Ylu ■ x ex s }^ii) 2 > 
such that 



sup \K(x, x') — Ka{x, x')\ < e + h.o.t., 

x : x' £X 



where h.o.t. denotes smaller terms in e 2 or greater. 



4.1 Laplacians, higher order regularizers and amplified resistances 



In the case when Q = L. a sparse (connected) graph Laplacian, as is typically the case in semi-supervised learning 



applications, Theorem 4.2 demonstrates that we can approximate the kernel Q + well. By applying simple trans- 



forms to the linear systems solved in the proof of Theorem 4.2 very similar results will hold for the normalized 
Laplacian i llorm = D^^-^LD^ 1 / 2 or other regualrizers obtained from simple transforms. Recent theoretical and 
practical results (Nadler et al. 2009 Zhou and Belkin 2011 von Luxburgetal. 2010[ ) demonstrate some problems 
with using L as a regularizer: for example, the solution of Laplacian regularised empirical risk minimization de- 
generates to a constant function with spikes at labeled points in the limit of large data whenever the intrinsic dimen- 
sionality of the data manifold is small. A solution suggested by the analysis of Zhou and Belkin ( 20 IT} is to include 
iterated Laplacians L p as regualrizers. It is important that our scheme applies to these more general regularizers 
which may not be sparse. In the case of iterated regularizers Q = R p , our method incurs an additional quadratic 
dependence on p and a logarithmic dependence on the generalized condition number k(R) — ||i?||2||-R + ||2 of R, 
the following is proved in the appendix: 
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Theorem 4.4. Given annxn intrinsic regualarization matrix Q = RP where R is symmetric, diagonally dominant 
and has s non-zero entries an approximation A to the kernel matrix Q + on X$ can be computed in expected time 
O (pns log rt(log log n) 2 (log \+p log k(R)) + n 2 n), and such that, 



\A - Q H 



< e 



h.o.t, 



where h.o.t. denotes terms involving order e 2 or higher. 



We also remark that kernels associated to the amplified resistances of | von Luxburg et al.| ( |2010] l can also be 
efficiently approximated. 

We have seen that our method will enable the construction of efficient data-dependent kernels based upon a 
variety of recent approaches for graph-based regularizion. We should finally mention that, when the graph is not 
given, forming a fc-nearest neighbor graph, for example, can be achieved in in O(nlogn) ( |Vaidya 1989 1 on low 
dimensional data and approximations exist for high dimensional data (Chen et al. ( 2009) ) so there is no other 
computational bottleneck in this approach. 



5 Experiments 

5.1 Semi-supervised binary classification 

We experiment on standard binary classification tasks. The first of our experiments compare the efficient semi- 
supervised kernels with LapSVM and a Gaussian RBF kernel SVM on the 'letter' data set from the UCI repository 



(Frank and Asuncion! |2010[ l and the 'MNIST digits' data ( |Lecun and Cortes) . We build fc-NN graphs with k = 5 



and 0—1 weights and form powers of the normalized Laplacian LP olm (with a small ridge term) as the basic intrinsic 
regularizer matrix Q. This was mixed with an ambient Gaussian RBF kernel K a , with bandwidth er, as in Q to 
form the kernel for LapSVM. The subsample was chosen at random, except for a strong bias to the labeled dat; 



In our experiments we use the preconditioned conjugate gradient solver, with the preconditioner of Koutis| ( |201 



in 



which uses a combination of combinatorial preconditioners and multigrid method^] to solve the linear systems 
required to obtain Q. We then form the efficient data-dependent kernel as in ^J, Model selection was performed 
using 5-fold cross validation over a grid of values for the exponent p, the level of intrinsic regularization r) and the 
bandwidth a of the Gaussian kernel (a could alternatively be chosen using a common heuristic). The subsample 
is formed using n = 250 points of the labeled and unlabeled data chosen uniformly at random. All results are 
averaged over 50 trials. In Figure [2] we give learning curves for the three methods: the a;-axis is the size of the 
labelled set. The efficient kernel recovers the performance of the full LapSVM. 

In the second set of experiments, Figure[3] our set up is as above but we consider larger datasets, the full 'MNIST 
digits' data, on which implementing the full LapSVM is infeasible. We consider the '4 vs 9' and '3 vs 8' tasks on 
12'000 labelled and unlabelled data points and the 'Odd vs Even' task on 64'000 data points (on which results are 
averaged over 25 trials). We consider small subsamples of size n — 250 and n = 500. We compare to the Gaussian 
RBF kernel and a simplistic implementation of "budget" LapSVM building a graph on the reduced subsample X$ 
only, as a sanity check to ensure the method outperforms this benchmark; the point here is that in practice one 
would work under computational budget constraints, and the natural choice would be between discarding most of 



5 It seems important to ensure that the labeled data does not exclusively contain points from the subsample - essentially so that cross validation 
is performed over some points not the domain of the intrinsic kernel, so that the algorithm does not learn the transductive problem, but a precise 
ratio seems unimportant. 

6 This is a practical solver using combinatorial preconditioners, though not the implementation achieving nearly-linear theoretical perfor- 
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the data and implementing the full LapSVM on the reduced sample or exploiting all data with the efficient kernel 
measuring functions at the subsample, since they have (roughly) the same complexity in the subsample size. The 
efficient LapSVM substantially outperforms both the "budget" LapSVM approach and the Gaussian RBF SVM, 
learning much faster with, in particular, a very small labelled sample. In particular a significant advantage can be 
gained from the efficient method's ability to exploit all 64'000 unlabelled points in the 'Odd vs Even' task. 



5.2 Clustering 

Another application of the efficiently computable data-dependent kernel is to clustering. We consider 2 class clus- 
tering on an artificially generated 2-moons data set with 1000 data points. For a kernel K : X x X — > E we define 
a metric d on X via d(x, x') — \\K{x 1 ■) — K(x', -)\\k — y K(x, x) + K(x', x') — 2K(x, x'). We investigate 
k -means clustering (k = 2) comparing the full LapSVM kernel, the efficient data-dependent kernel (generated as 
outlined in Section 



5.1 



with p = 2) and Euclidean distance. The efficient kernel uses a subsample X$ of size 
n = 40 to measure functions, whereas the full LapSVM kernel uses all 1000 data points. We selected the best ker- 
nels from a small grid over the parameters 7 and 77. The results are displayed in Figure |4] the Euclidean distances 
incurred an error of 11.4%, the full LapSVM kernel achieved perfect clustering with 0% and the efficient kernel 
achieved 1% error. Thus using a subsample of just 4% of the data, we are able to almost recover, in nearly-linear 
rather than cubic time, the performance of the full LapSVM kernel. 



Dataset 


sample size n (labeled + unlabeled) 


subsample size n = \Xs\ 


test set size 


letter D vs O 


1250 


250 


308 


letter O vs Q 


1250 


250 


286 


MNIST 2 vs 3 


2000 


250 


405 


MNIST 3 vs 8 


12'000 


250 


1966 


MNIST 4 vs 9 


12'000 


250 


1782 


MNIST Odd vs Even 


64 '000 


500 


500 



Table 1: Binary classification experiments 



5.3 Practical timing results 

To validate the practical timing performance of the proposed method we consider the time taken to compute the 
semi-supervised kernels on the MNIST data, as detailed in Section [5TT| but using a non-normalised Laplacian and 
p = 1, 7 = 1 and i] = 1. We consider the computation time of the inverse of the n x n matrix (In + r]QK) , 
including the computation of the matrix Q from Q, which is the heart of the efficient kernel computation, and 
theoretically nearly-linear. We compare 2 methods of solving the linear systems required to compute Q: the pre- 



conditioned conjugate gradient solver, with the combinatorial preconditioner of Koutis (2011 1 used in the exper- 
iments; and the Matlab "backslash" operator. We compare these results to the computation of the inverse of the 
(non-sparse) n x n matrix (J n + rjQK)^ 1 which is the computational bottleneck of the standard semi-supervised 
kernel construction, and is cubic in complexity. 

Results are shown in Figure [5] in practice the method is extremely fast, the efficiently computable kernels can 
be computed on 64'000 MNIST data points in 3 minutes (and the computation remains feasible on much larger data 
still). The preconditioned conjugate gradient method achieves approximately linear complexity in our experiments. 
The backslash method is also very fast on small data sizes (presumably due to the vectorization of the Matlab 
implementation) but appears to be growing super-linearly on this data set. As expected, the computation time of the 
standard semi-supervised kernel construction becomes infeasible for just a few thousand data points. 
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Figure 2: Classification: small data sets 




Figure 3: Classification: large data sets 
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Kmeans on Euclidean distance matrix 



Kmeans using SSL kernel (1000 basis points) 



Kmeans using Efficient SSL kernel (50 basis points) 



»»»**, 
fc 



Jf IKS? 



(a) Euclidean distance 



(b) Full LapSVM kernel 



Dimension 1 

(c) Efficient kernel 



Figure 4: Clustering: 2-moons data, 1000 data points 

6 Conclusions 

We have presented a method for generating data-dependent kernels in nearly-linear time. The method is based on 
disconnecting the number of data points used to build a data-dependent regularization matrix and the number of 
points at which functions are measured. By measuring at fewer points and (implicitly) interpolating, our method is 
able to exploit huge amounts of unlabelled data in semi-supervised and unsupervised learning tasks. 

Encouragingly, our experiments show that a significant advantage can be gained in semi-supervised learning 
from the ability to exploit a much greater quantity of unlabelled data: on large datasets of 64' 000 data points the 
advantage gained from exploiting the large quantity of unlabelled data is clear, and much greater than the improve- 
ment demonstrated when only a small quantity of unlabelled data can be exploited. In a clustering experiment the 
method approximately recovers the performance of the full kernel by measuring functions at a small fraction the 
datapoints. 
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A Proofs 



A.l Proof of TheoremTTl 

Proof. We just need to check the reproducing property of K = R + for all v E V and h e im(iJ): (h, K(vi, -))-u — 
(h, R+e^n = h T RR+e, = h T e t =hi = h( Vi ). □ 



A.2 Proof of Theorem 1431 

Proof. Theorem 



4.2 



implies that in time 0(ns log n(log log n) 2 log q VV K + n 2 n), we can compute an A such that, 



rjriK 



(13) 



which implies that (see for example [Horn and Johnson||1990| section 5.8), 

\\{A + 71K)- 1 - (Q- 1 + vK)- 1 ^ < -4- + h.o.t. 

rjriK 



(14) 



where h.o.t. denotes terms in e 2 or greater. Define Ka(x,x') := K(x,x') + rjk x (A + rjK) 1 k x >. Then since 
su Px<ex I \kx\ 1 2 = K " an d since Q is positive definite, 

sup \K A (x,x')-k(x,x')\=r) sup \k T x ((A + rjK)- 1 - (Q- 1 + ^Ky 1 ) k x , \ 

x,x'ex x,x'ex v ' 

< rjWiA + rjK)- 1 - (Q- 1 +7 1 K)- 1 \\ 2 snp \\k x \\ 2 

xex 

< e + h.o.t. 

□ 

A.3 Proof of Theroem \4A\ 



Proof. We begin by making pn calls to the solver of Koutis et al. ( 201 1 1 to iteratively solve the equations 



Rz\ j) = z^ 1] 



for each i where x Si £ X$ and all 1 < j < p and where zf^ — e Si . This gives = R + zf + such that 



(15) 



in total time 0(pns log n(log log n) 2 log -) by Lemma 



4.1 



Now note that, 



=R + z9~ X) +r^ 



= R+{R + z\ J - 2) + P W- 1 )) + pW) 



(R+Yzf + (R+y- L r. 



+vf-i„( 1 ) 
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Thus, 



,0) 



(R + ye Si \\ R <\\(R + y- l r^\\ R 



\R + rt 1] \\R 



\\R 



+ni- 1 iu.( 1 )i 



R ' 



\R + \\2\\r 



O'-i) i 



\R 



\R 

\4 j) \\n 



Now note, by repeatedly applying ( 15 i 



i ( fc )n ^ 
\ r i \\R < e 

< e 

< e 

< e 

< e 



< e 

< e 

< e 



^llall^IlK 
H+|| 2 (||i? + ^ (fe - 2) 

R + \\2 (\\R + \\2\\Z. 



R - 

(fc-2). 



.(*-!) I 



R + e |LR+zf- 2 >|| fl 



+ l|2|u(fc-2)| 



+ /l.O.t. 



^ + ir i ik (1) iiK+^- 



+ ll fe -l|| E>+ 



,fc-j 



R + e s 
+ h.o.t. 



R ' 



h.o.t. 



and so plugging this into ( |T6] > gives, 

-( 

- (R + ye H \\ Ri < je\\R\\ { r 1)/2 \\R + \\ { r ¥> + h.o.t. 



\zW - (R + ) j e Si \\ R <je\\R + \\ { 2 j 15 + h.o.t. 



» 



Now let Z := 



(4 



(p) 



|^'-Q+e Si || Q <pe||H||^ 
» \ and 



0-i)/2|ljj+ll(p-i) 



h.o.t. 



(16) 



(17) 



A := ZQZ = Z RPZ, 

and note that A can be computed with 0(psn + n 2 n) operations since R has s non-zero entries. Now note that, 



A, 



\Qt iSj 



A, 



= \elQ+e Sj -z^ T Qzf\ 



= \(Q+e Si - zl P) yQQ+e S] + (Q+e s . - zf>) T QQ+e ai - (Q+e Si - ^) T Q(Q+e Sj - z 
< \\Q+e Sz - z\ p) \\ Q Je s ^Q+e S] + \\Q + e S] - zf\\ Q ^/e s ^Q+e Si 



\\Q+e Si -z^\\ Q \\Q+e Sj -z 



(p)i 



Q 



<2pe\\R\\^ 



(p-l)/2|| i? + 



h.o.t. 
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(v— 11/2 _i_ 3p-1 

which, after setting e' such that e = 2pe' \\R\\ 2 \\ R + 1 1 2 2 , we have that in time complexity, 

0(pns log n(log log n) 2 log — f + pn 2 s) 

= 0(pns logn(loglogn) 2 (logp + plog ||J?||2||J? + ||2 + log ^) + n 2 n) 



the guarantee, 



\Qti ~ A l3 \ < e + h.o.t. 



□ 
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